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Abstract 

It is commonly accepted that machine 
translation is a more complex task than 
part of speech tagging. But how much 
more complex? In this paper we make 
an attempt to develop a general framework 
and methodology for computing the in¬ 
formational and/or processing complexity 
of NLP applications and tasks. We de¬ 
fine a universal framework akin to a Turn¬ 
ing Machine that attempts to fit (most) 
NLP tasks into one paradigm. We cal¬ 
culate the complexities of various NLP 
tasks using measures of Shannon Entropy, 
and compare ‘simple’ ones such as part 
of speech tagging to ‘complex’ ones such 
as machine translation. This paper pro¬ 
vides a first, though far from perfect, at¬ 
tempt to quantify NLP tasks under a uni¬ 
form paradigm. We point out current de¬ 
ficiencies and suggest some avenues for 
fruitful research. 

1 Introduction 

The purpose of this paper is to suggest a unified 
framework in which modern NLP research can 
quantitatively describe and compare NLP tasks. 
Even though everyone agrees that some NLP tasks 
are more complex than others, e.g., machine trans¬ 
lation is ‘harder’ than syntactic parsing, which in 
turn is ‘harder’ than part-of-speech tagging, we 
cannot compute the relative complexities of dif¬ 
ferent NLP tasks and subtasks. 

In the typical current NLP paradigm, re¬ 
searchers apply several machine learning algo¬ 
rithms to a problem, report on their performance 
levels, and establish the winner as setting the level 
to beat in the future. We have no single overall 
model of NLP that subsumes and regularizes its 
various tasks. If you were to ask NLP researchers 


today they would say that no such model is possi¬ 
ble, and that NLP is a collection of several semi¬ 
independent research directions that all focus on 
language and mostly use machine learning tech¬ 
niques. Researchers will tell you that a good sum¬ 
marization system on DUC/TAC dataset obtains 
a ROUGE score of 0.40, a good French-English 
translation system achieves a BLUE score of 37.0, 
20-news classifiers can achieve accuracy of 0.85, 
and named entity recognition systems a recall of 
0.95, and these numbers are not comparable. Fur¬ 
ther, we usually pay little attention to additional 
important factors such as the performance curve 
with respect to the amount of training data, the 
amount of preprocessing required, the size and 
complexity of auxiliary information required, etc. 
And even when some studies do report such num¬ 
bers, in NLP we don’t know how to characterize 
these aspects in general and across applications, 
how to quantify them in relationship to each other. 

We here describe our first attempt to develop a 
single generic high-level model of NLP. We adopt 
the model of a universal machine, akin to a Turing 
Machine but specific to the concerns of language 
processing, and show how it can be instantiated in 
different ways for different applications. We em¬ 
ploy Shannon Entropy within the machine to mea¬ 
sure the complexity of each NLP task. 

In his epoch-making work, Shannon (119511 ) 
demonstrated how to compute the amount of in¬ 
formation in a message. He considered the case 
in which a string of input symbols is considered 
one by one, and the uncertainty of the next is mea¬ 
sured by counting how difficult it is to guess. We 
make the fundamental assumption that most NLP 
tasks can be viewed as transformations of nota¬ 
tion, in which a stream of input symbols is trans¬ 
formed and/or embellished into a stream of output 
symbols (for example, POS tagging is the task of 
embellishing each symbol with its tag, and MT is 
the task of outputting the appropriate translation 




word(s)). Under this assumption one can ask: how 
much uncertainty is there in making the embellish¬ 
ment or transformation? This clearly depends on 
the precise nature of the task, on the associated 
auxiliary knowledge resources, and on the actual 
algorithm employed. We discuss each of these is¬ 
sues below. We first describe the key challenge in¬ 
volved in performing uncertainty comparison us¬ 
ing the Entropy measure in Section 2. In Sec¬ 
tion 3 we provide high-level comments on what 
properties a framework should have to enable fair 
complexity comparison. In Section 4, based on 
the properties identified in Section 3, we consider 
the theoretical nature of NLP tasks and provide 
suggestions for instantiating the paradigm. The 
framework is described in Sections 5, 6 and our re¬ 
sults are presented in Section 7. We point out cur¬ 
rent deficiencies and suggest avenues for fruitful 
research in Section 8, followed by a conclusion. 

2 The Dilemma for Shannon Entropy 

2.1 Review of Entropy and Cross Entropy 

Entropy, denoted as —^2 x P(y) illustrates 

the amount of information contained in a mes¬ 
sage, and can be characterized as the uncertainty 
of a random variable of a process. For example, 
Shannon d 1951 b reported an upper bound of 1.3 
bits/character symbol for English character predic¬ 
tion and 5.9 bits/word symbol for English word 
prediction, meaning that it is highly likely that En¬ 
glish word prediction is a harder task than English 
character prediction. 

If the output Y n = {yo, yi, ...y n } is a sequence 
generated from the input, a stationary stochastic 
process. Then the entropy of Y is given by: 


H(Y) = lim H(y n \y n -i,y n - 2 ~;) (1) 

n—»■ oo 


By the Shannon-McMillan-Breiman theorem 
([Algoet and Cover, 1988[) this can be written as: 


H(Y)= lim -- logP(t/i,t/ 2 , (2) 

n—>oo n 


So we can define its hardness or complexity by 
computing entropy from the distribution P(Y) for 
tasks like Shannon’s word prediction model, or ex¬ 
tend it to a noisy channel model ( [Shannon, 1948j ): 
given a sequence of inputs X, the uncertainty of 
the output transformation is given by H(Y |X), in¬ 
terpreted as the amount of uncertainty remaining 
about Y when X is already known. 


The true distribution over Y is hard to estimate. 
Normally we estimate the upper bound of entropy 
— the cross entropy denoted as H(P, P )—to ap¬ 
proximate the true value of entropy: 


H(P,P) = H{P) + D KL (P\\P)>H(P) (3) 


where Dkl(P\\P) denotes the KL divergence 
between two distributions P and P. A good 
model will closely approximate P using P, lead¬ 
ing to smaller value of Dkl(P\\P), i.e., bringing 
the value cross-entropy closer to that of the real 
one. Different models would obtain different 
values of H(P,P). Various studies since Shan- 
work (e.g.,(Kucera and Francis, 1967 


non s 


Cover and King, 1978 


Gopinath and Cover, 1987; [Brown et al., 1992] )) 


have explored methods to lower the upper bound 
of character prediction entropy in English by 
using more sophisticated models. 


2.2 The Dilemma for Entropy 

While entropy describes the intrinsic nature of the 
problem or task, its actual value estimation has to 
be determined by the specific model you adopt for 
prediction. When Shannon approached the char¬ 
acter prediction task, his wife acted as the predic¬ 
tor. Alternatively, if Shannon had used a child as 
predictor, he would have obtained a much larger 
estimated entropy. 

Similarly, if one wishes to compare the entropy 
of two tasks, for example, to determine which lan¬ 
guage sequence is harder to predict, English or 
French, it would be problematic if one compares 
the entropy computed via a linguist for English 
and a child for French. One requires twins who 
are mathematically and linguistically identical in 
terms of English and French for a fair compari¬ 
son ( [Cover and Thomas, 2012| ). However, in real 
world, it is almost impossible to find such twins. 
Different models are attuned differently to dif¬ 
ferent scenarios, tasks, datasets, evaluation met¬ 
rics, parameter settings, or optimization strategies. 
One model might not fit all tasks equally well, 
e.g., SVMs are not designed to predict probabil¬ 
ities, CRFs offer more insights in sequence label¬ 
ing tasks than SVMs but are hard to use straight¬ 
forwardly for text classification, etc. 

In summary, though entropy provides a theo¬ 
retical definition about the uncertainty of a data 
source or task, the fact that its estimation must 
be performed using a real specific model poses a 























dilemma for the accurate estimation of the uncer¬ 
tainty of tasks and hence for their fair comparison. 

3 Prerequisites for Fair Comparison 

We claim that a framework should incorporate the 
following elements to enable a fair complexity 
comparison of disparate NLP tasks and systems: 

A universal measure: Complexity can be mea¬ 
sured in terms of multiple aspects (e.g., the amount 
of training data required, the amount of prepro¬ 
cessing required, the size and complexity of aux¬ 
iliary information, training time, memory usage, 
or even lines of code). But we need a universal 
and appropriate metric. In this work, we propose 
Shannon Entropy as the universal metric, which 
we believe reflects the intrinsic randomness, pre¬ 
dictability, and uncertainty of datasets and tasks. 
All the above aspects are highly correlated with 
Shannon Entropy. 

A universal Engine: A POS tagging system 
makes decisions by selecting tags with highest 
probability while a summarization system selects 
the top-ranked sentences. A fair comparison of 
complexity, however, requires a single general and 
unified engine to define all (at least most of) NLP 
tasks within the same framework. The abovemen- 
tioned notation transformation paradigm, elabo¬ 
rated in the following section, accommodates most 
NLP tasks. 

A universal model: We cannot fairly compare 
the entropy obtained from a logistic regression 
model on POS tags to that produced from a large 
framework of interdependent alignment, phrase 
extraction, decoding algorithms for machine trans¬ 
lation. A unified model should work with predic¬ 
tions for all (or at least most) current NLP tasks, 
and should make relatively accurate predictions (a 
random guess model, for example, is general but 
would not be helpful). We propose in Section 6 a 
candidate model. 

4 The Nature of NLP 

In order to propose a single multi-purpose uni¬ 
fied Engine for NLP one has to adopt a very gen¬ 
eral perspective. When constructing an NL sys¬ 
tem one typically assembles a variety of compo¬ 
nents. Some of them are active modules that in¬ 
stantiate algorithms and perform transformations. 
Others are passive resources (like lexicons, proba¬ 
bility tables, or rulesets) that support the former. 


Active modules are sometimes built to produce 
passive ones. It is important to differentiate the 
role of modules in a framework in order to prop¬ 
erly estimate the overall complexity. In this sec¬ 
tion we first categorize the primary roles of NLP 
(sub)systems and then postulate that modern NLP 
algorithms (largely) fall into three distinct com¬ 
plexity types. 

4.1 Three Classes of NLP System 

NLP systems generally perform one of the follow¬ 
ing three functions/roles: (i) research into aspects 
of the nature of language(s), (ii) application tasks 
and subtasks, and (iii) support algorithms. The 
majority of NLP development today falls into the 
second and third classes. 

Research into language includes such studies as 
determining the Zipfian nature and the entropy of 
language, discovering changes in patterns of use 
over time and across geographic regions, identify¬ 
ing text genres by for example creating word and 
constituent distribution profiles, and so on. 

Application tasks include machine translation, 
information retrieval, speech recognition, natural 
language interpretation (both syntactic parsing and 
semantic analysis), information extraction, ques¬ 
tion answering, dialogue processing, text summa¬ 
rization, text (sentence and multi-sentence) gen¬ 
eration, sentiment analysis / opinion mining, text 
mining / harvesting, and others. Subtasks in¬ 
clude part of speech tagging, chunking, corefer¬ 
ence resolution, text segmentation, query analy¬ 
sis, bitext alignment, reference generation, pro¬ 
filing/characterization of language producers, and 
many others, as well as numerous resource cre¬ 
ation tasks including building monolingual and 
bilingual lexicons, distributional semantic word 
profiles and embeddings, word sense lists, ontolo¬ 
gies / taxonomies, word-sentiment lists, and many 
others. 

Support algorithms include a variety of generic 
procedures that are reused in many applications. 
In addition to the classic Finite State / Augmented 
Transition Network technology from the 1960s 
and later, modern NLP usually works with the sta¬ 
tistical properties of large collections of words, 
and modern support algorithms such as HMMs 
and others generally assign and use count-derived 
scores to [sets of] words, such as tf.idf, PMI, 
and others, or distribute probability across sets 
of labels, words or documents, such as Expecta- 


tion Maximization, Probabilistic Graphical Mod¬ 
els, Topic Modeling algorithms, certain clustering 
algorithms, etc. Some support algorithms focus 
on processing human labeling (annotation) and 
comparing the results of various different label¬ 
ing agents (human and machine), such as Jensen- 
Shannon and other distribution comparison scor¬ 
ing, annotation optimization procedures, etc. 

4.2 Three Levels of NLP Algorithm 

We postulate that (almost every) NLP task / sub¬ 
task can be defined as [a combination of] one of 
three basic operations, listed in order of complex¬ 
ity: 

Level 1: Prediction: The algorithm reads its 
input, which includes principally a sequence of 
units of some kind, and predicts the next item in 
the sequence. Example: predicting the next word 
in a stream, as used by Shannon to calculate the 
information content of text. 

Level 2: Labeling: The algorithm reads its in¬ 
put and generate label(s) based on it. Labeling 
tasks can be divided into two subcategories: 

• Aligned Labeling: there is a one-to-one 
correspondence between inputs and outputs. 
Aligned tasks include most tagging tasks 
(e.g., named entity tagging and part-of- 
speech tagging). 

• Unaligned Labeling: no aligned correspon¬ 
dence exists between inputs and outputs. Un¬ 
aligned sequence-label tasks can be further 
divided into single-label tasks such as cate¬ 
gorization or clustering, in which a single la¬ 
bel is assigned given the input (e.g., classifi¬ 
cation of documents each into one class), and 
sequence-label tasks, in which a sequence of 
labels is produced (e.g., MT, where the labels 
are target language words). 

Level 3: Scoring: The algorithm reads its in¬ 
put and assigns a score (without loss of general¬ 
ity, a real value between zero and unity) to some 
unit(s) in it. The score may be a probability, rat¬ 
ing, or some other score. Example: tf.idf scoring 
of words. 

In a probabilistic paradigm, the probability of 
Level 1 tasks can be characterized as P(Y) where 
Y denotes the sequence to predict. Level 2 tasks 
can be characterized as P(Y\X) where X denotes 
the input and Y denotes the label(s) to generate. 


Often, one operation is used to perform another. 
It is typical in modern-day (post-1990s) NLP to 
perform all kinds of labeling (Level 2 operations) 
by scoring all relevant possible categories (a Level 
3 operation) and then returning the highest-scoring 
one as the selected tag. This contrasts with pre- 
1990s NLP that generally computed a single re¬ 
sult, such as the desired label, as the one and only 
possible answer. 

A task may require several operations in se¬ 
quence. For example, syntactic parsing requires 
labeling the part of speech tag of each word, la¬ 
beling the start and end words of syntactic con¬ 
stituents, labeling the head of each constituent, 
and labeling the syntactic role of each constituent 
with regard to its immediate head. Sometimes the 
label is drawn from a small set of possible tags that 
is predefined by theorists or the researcher, such as 
the part of speech tags. Sometimes the label is pro¬ 
vided in the text, such as the head word of a syn¬ 
tactic constituent. Sometimes the label is a value 
computed by a scoring operation, such as the PMI 
score of a word pair in a corpus. 

5 The Universal NLP Engine 

The generic NLP Engine contains (see Figure 1): 

The transformation engine E, which takes as 
input one or more symbols from S and produces 
zero or more labels in response. 

The input stream X, which contains the text 
(without loss of generality, we talk about text (a 
sequence of words and punctuation), but S might 
instead be a sequence of symbols from some other 
vocabulary, such as part of speech tags, or a mix¬ 
ture of several vocabularies, such as words with 
their individual part of speech tags). We therefore 
consider S as consisting of an essentially infinite 
stream of units, each unit being a symbol (or set 
of associated symbols) for which a label (or set of 
labels) is to be computed by E. Let X denote the 
set of source symbols. 

The data resource(s) R, typically a lexicon, a 
grammar, a probability model, or the output of 
some subtask, used by E to perform its transfor¬ 
mation. 

The output label(s) Y, a set of predefined sym¬ 
bols that E produces. Let Y denote the set of target 
symbols (including labels). We have My £ Y, Y £ 
Y and also possibly X £ Y. 

We next describe a generic procedure for imple¬ 
menting the machine based on the following as- 


sumption. 

Assumption 0.1 [Most] modem NLP tasks can 
be viewed as predicting a (sequence of) token)s) 
(i.e., Y" ) using a finite-state Turing Machine. 

Such a procedure allows one to measure and com¬ 
pare various aspects of [almost] any NLP task 
and subtask in a systematic way, and to thereby 
compare the computational properties of alterna¬ 
tive approaches and implementations to any NLP 
(sub)task. 

The following examples, using the same input 
stream 2f n =“Dog eats apple”, illustrate how the 
engine works by phrasing several modern NLP 
tasks as sequential token prediction problems: 

• Sentiment Classification: 

Y = {“-1”, “0”, “1”}, respectively for nega¬ 
tive, neutral, and positive sentiment 
yn_“ 0 ” (meaning: neutral sentiment). 

• POS tagging: 

Y = {Penn Treebank POS Tags} 

Y n =“NNP VBZ NN”. 

• Syntactic Parsing: 

Y = {“(ROOT”, “(S”, “(NP”, “)”, ...} 
Y n =(ROOT (S (NP (NNP)) (VP (VBZ ) (NP 
(NN ))))). 


work of Andreas et al. (120131) that illustrates that 
semantic parsing can to some extent be viewed as 
a machine translation problem. Treating syntactic 
parsing as a string prediction task is defined and 
implemented in (Vinyals et ah, 2014 1 . 

6 The Universal Entropy Model 

In this section, we discuss the generic model for 
computing Entropy within a Language Engine. 

6.1 Requirements 

We first identify requirements for the universal 
model. The random guess model satisfies the gen¬ 
erality property as it can be used in any predictive 
model. But it of course does not constitute a good 
predictive model, since it delivers estimated dis¬ 
tributions far away from the actual distribution. In 
contrast, n-order Markov models (n=l,2,3...) seem 
to serve the purpose well and can be easily ap¬ 
plied in sequence-labeling tasks such as NER and 
POS tagging. However it is tricky to adapt them to 
single-label predictions such as text classification. 
Additionally, their dependence on specific feature 
selections make fair comparison across different 
implementations complicated or impossible. 

This consideration leads to the first requirement 
for a model: 


• Semantic analysis: 

Y = {English Word List, Relation List...} 
Y n =“3 e . eat (e ) A agent (e , dog ) A patient 
( e , apple )”. 


Requirement 0.1 The model should be able to 
leverage different types of features automatically 
to avoid infinitely complicated feature engineering 
procedures. 


• Word Sense Disambiguation: 

Y = {“1”, “2”, “3”, “4”, ...} 

Y n =“ 1 3 1”, correspond to the 1st, 3rd, and 
1st senses for the correspondent token. 

• Machine Translation: 

Y = {French Words, Punctuations} 
K„=“chicn mange pomme”. 

• Summarization: 

Y = {English Words, Punctuations} 
K„=“dog eats apple” (the gold-standard sum¬ 
mary is the original sentence). 


We consider P{Y n ) and P(Y n \X n ) to gain in¬ 
sights about better predictions for entropy calcu¬ 
lation. We use P(Y n ) for illustration as it can be 
easily extended to P(Y n \X n ). In the sequence 
prediction task, let F n denote the entropy where 
predictions are made given previous n tokens (an 
n-order Markov model). As proved by Shan¬ 
non (Shannon, 1951| , F n monotonically decreases 
with respect to n and F[{Y n ) is strictly bounded 
by F n : 


Fi>F 2 > ... > Foo > H(Y n ) (4) 


The proposed framework is inspired by 
recent progress of sequence-to-sequence 
prediction models in NLP, such as ma- 


chine translation 

(Sutskever et ah, 2014, 

Bahdanau et ah,2014, 

Vinyals et ah, 2014, 

Cho et ah, 2014; Graves et al., 2014) and the 


Taking as example an n-gram word predic¬ 
tion model, theoretically estimated entropy de¬ 
creases as predictions are made based on increas¬ 
ingly many preceding tokens, roughly stated in 
(Shannon, 1951) . However, issues arise when n is 
too large to maintain an n-gram probability table. 






























Based Shannon’s proof, for the purpose of to 
the largest extent approximating real entropy us¬ 
ing estimated entropy, we generalize the second 
property of the model as follows: 


These thoughts are inspired by recent progress 
of sequence-to-sequence generation models 
(Sutskever et al., 2014 Bahdanau et ah, 2014 


Vinyals et al., 2014] Cho et ah, 2014 ). 


Requirement 0.2 The model should be able to 
memorize earlier information as much as possible 
given the computing power and storage capacity 
available. 


6.2 Model 

This line of thinking suggests using as model 


recurrent neural networks (Mikolov et ah, 2010 


Funahashi and Nakamura, 1993) or 

sophisticated versions like LSTM 

dHochreiter and Schmidhuber, 1997p (Please 

see Appendix for details about LSTM models). 

Recurrent networks obtain a fixed-sized vector 
for each step within the processing sequence by 
convoluting current information with output from 
earlier step(s). Such vectors can be viewed as 
combining evidence obtained so far, and are used 
to predict the subsequent token(s), typically using 
a softmax function. For labeling tasks, recurrent 
neural networks first map the input X n of arbi¬ 
trary length to a fixed-sized vector, which can be 
viewed as evidence, and then map that vector to 
the output by convoluting feature representations 
at each step. 

Recurrent models have the following merits: 

(1) They obey requirement 1 by automatically 
encoding “features” in the real-valued represen¬ 
tation vectors without explicit feature selection 
and engineering. Though the models still re¬ 
quire significant parameter tuning, they provide 
a relatively unified procedure for comparison. 

(2) By sequentially convoluting each token with 
output from earlier step(s) they have the ability 
to ‘remember’ information required to approxi¬ 
mate (to some degree) the conditional probability 
of linin^oo P(y t \y t _i,y t _ 2 , •••, Vt-n), which par¬ 
tially addresses requirement 2. (3) The model 
is manageable since it uses constant memory size 
and runs in linear time. 

We explicitly do not claim that recurrent neural 
models are a perfect choice as model in an 
NLP Engine. We acknowledge their numerous 
shortcomings, and discuss some pros and cons in 
the concluding section. However, we believe that 
they do offer advantages over other models we 
have considered with regard to tradeoffs of gen¬ 
erality, computing power, and storage capacity. 


7 Experiments: Comparing the 
Uncertainty of NLP Tasks 

Using the above framework, we now calculate the 
exact entropy for a few NLP tasks. What must 
never be overlooked however is the impact of the 
training/testing datasets used (e.g., the complex¬ 
ity for guessing subsequent words in novels and 
newspapers can be different) and how exactly the 
task is defined (e.g., differences of complexity in 
sentiment classification between a 5-class and 2- 
class problem are huge). 


7.1 Tasks and Datasets 

Prediction Tasks: We use Wikipedia 2014 cor¬ 
pus, divided half and half for training and test¬ 
ing. We employ the most-frequent 200,000 words 
and add an “unknown” symbol to represent the 
remainder, making it a 200,001-class prediction 
problem. This is a simple Prediction task. 


Sentiment Analysis ( Pang et al., 2002| )’s 
dataset comprises sentences containing gold- 
standard sentiment labels tagged at the start of 
each sentence. We divide the original dataset into 
training(8101)/dev(500)/testing(2000). This is an 
Unaligned Labeling (single) task. 


Question-Answering (UMD) The dataset 
comprises two domains, History and Liter¬ 
ature, and contains roughly 2,000 questions 
where each question is paired with an answer 
(Iyyer et a l,, 2014 ). Since answers are selected 
from a pool of roughly 100 answer candidates, 
this is not an open QA problem but a multi-class 
classification problem; i.e., an Unaligned Labeling 
(single) task. 


Machine Translation We use the WMT14 
English-French dataset and the OpenMT12 
English-Chinese dataset. This is an Unaligned 
Labeling (sequence) task. 

Part-of-Speech Tagging (Penn Treebank) We 

use a random sample of Wiki2014 as training and 
testing dataset, each of which containing 1 million 
sentences. Gold-standard labels are assigned us¬ 
ing the Stanford POS tagger. This is an Aligned 
Labeling task. 


















Name Entity Recognition (CoNLL) We use 

the CoNLL-2003 English benchmark for train¬ 
ing, which labels four entity types (person, loca¬ 
tion, organization, miscellaneous). The models 
are tested on CoNLL-2003 testing data.. This is 
an Aligned Labeling task. 


Syntactic Parsing Training data is the 
OntoNotes corpus (Hovy et ah, 20061 and English 
Web Treebank dPetrov and McDonald, 2012] I 
with an additional 5 million random sen¬ 
tences, all parsed by the Stanford Parser 
( jSocher et al., 20131 ). The testing dataset is 
Section 22 of the Penn Treebank plus 1000 sen¬ 
tences from the Question Treebank. We followed 
protocols defined in (Vinyals et al., 2014). This is 
an Unaligned Labeling (sequence) task. 


Question Answer (Open-domain) We use the 

Yahoo Comprehensive QA dataset. The dataset 
comprises roughly 4 million QA pairs. Questions 
and answers are sequences of tokens. Questions 
are treated as inputs and models predict word se¬ 
quences as responsive answers. This is an Un¬ 
aligned Labeling (sequence) task. 


7.2 Implementations 
7.2.1 Prediction Task 

Implementations for prediction tasks, where 
P(Y n ) is to be estimated, are similar to 
recurrent language models as defined in 
(Mikolov et al., 20l0| ). Let et -1 denote the 
representation obtained for timestep t — 1 based 
on preceding information from the LSTM. Let 
ey t denote the feature representation for the 
token to be predicted at time t. By adopting a 
softmax function, the conditional probability for 
the occurrence of the current token given earlier 
evidence is given by: 

t I N f fet-l i 6yt) 

p{yt\yt-i,-,yt-n) == ^-77—-—7 ( 5 ) 

Z^YgY J\ e t-h e y) 

where f(et-\. e yt ) denotes the compositional 
function between vectors et- 1 and e yt . In this pa¬ 
per, we adopt the form of exponential dot product 

f°r /(•): 


f(e t -ye yt ) = exp(e t _i • e yt ) 


( 6 ) 


7.2.2 Labeling Task 

We refer to frameworks ( jSutskever et ah, 2014( 
Bahdanau et ah, 2014[ Vinyals et ah, 2014 1 ) by 


first concatenating input and output {X n ,Y n } = 
{xi,.., x n , y \,.., y n }. Let et-i denote the LTSM 
output at timestep t — 1 by convoluting all 
preceding tokens before t in {X n ,Y n }, i.e., 

Unaligned Single Labeling Single-tag Label¬ 
ing corresponds to the special case where the size 
of Y n is 1. Taking Sentiment Analysis as an ex¬ 
ample, sentence-level embeddings (denoted as e n , 
where n denotes the length of the current sentence) 
are first obtained recurrently from the LSTM. As 
it is a binary classification problem, we have: 


P(y 10 


exp(e n • e y ) 
Sj/e{o,i} ex P( e n ' e y') 


(V) 


Question-Answering (UMD) is implemented in a 
similar way. 


Unaligned Sequence Labeling Lollowing 
( jBahdanau et ah, 201 4\ Vinyals et ah, 2014] ), the 
conditional probability for predicting the current 
token y t in Y n is given by 


P(Y n \X n )= I] P(yt\xi,...,x n , yi ,yt-i) 


1 <t<n 


n , 

i<r<n z ^ eY 


f{ e t~ 1) £y t ) 


J2 v eY e y 


/(•) takes the same form as in Eq{6l 


(8) 


Aligned Sequence Labeling In aligned se¬ 
quence labeling tasks, there is a one-to-one corre¬ 
spondences between output y t and input/, which 
should be captured in the model. Decisions at 
timestep t are made by combining LSTM repre¬ 
sentation et -1 and input representation e Xt : 


P(Y n \X n )= [] P{yt\xi,...,x n ,yi,..., yi -i) 

l<t<n 


= n 

l<t<n 


f { e t— 1? e y t ? e Xj ) 
YlyeY f{ e t- 1) e yi e Xi) 


e yt , e Xi ) is given as below: 


(9) 


f(e t -i,e yt ,e Xi ) = exp((7 • (W ■ [e t -i,e yt ,e Xi ])) 

( 10 ) 

where [et-i,e yt , e x J denotes the concatenation of 
the three vectors and W and U denote convolu¬ 
tional matrix and vector to project the concate¬ 
nated vector to a scalar. Taking POS tagging as 
example, for the sentence X n =“dog eats bones” 



























Task 

Avg Entropy 

Word Prediction (Wiki) 

7.12 

English-Chinese Translation 

5.17 

English-French Translation 

3.92 

QA (Open-domain) 

3.87 

Syntactic Parsing 

1.18 

QA (UMD) 

1.08 

Text Classification (20 news) 

0.70 

Sentiment (Pang) 

0.58 

Part-of-Speech Tagging 

0.42 

Name Entity Recognition 

0.31 


Table 1: Average entropy for different NLP tasks 
with correspondent dataset specified. 


with correspondent labels Y n =“NN VBZ NNS”, 
we first concatenate X n with Y n : “dog eats bones 
NN VBZ NNS”. When making predictions at to¬ 
ken “VBZ”, let &lstm denote the LSTM embed¬ 
ding computed at preceding token “NN”, evBZ de¬ 
note the embedding for token “VBZ”, e ea ts de¬ 
note the correspondent input embedding for token 
“eats”. Then the probability for generating part- 
of-speech tag VBZ is given by: 

p(VBZ|-) = ^^ LS ™’ evBZ ’ eeats ^ (ii) 

Z^-ySG¥ / i e LSTM, e y , e eats ) 


7.3 Details 


For each task, word embeddings are initialized 
o the same pre-trained vectors for fairness. Pre¬ 
trained embeddings were obtained from word2vec 
on a 6-billion-word corpus with dimensionality 
512. LSTM models are composed of one single 
hidden layer. Stochastic gradient decent (without 
momentum) with mini-batch ( {Cotter et al., 2011) 
is adopted. For each task, we use a learning initial 
learning rate of 0.5 with a linear decay. Learning 
stops after 4 iterations. We initialized the LSTM 
parameters using a uniform distribution between 


[-0.1, 0.1]. Referring to (Sutskever et al., 2014), 
the gradient is normalized if its value exceeds a 
threshold to avoid exploding gradients. For un¬ 
aligned sequence prediction tasks (i.e., syntactic 
parsing, QA(open domain)), inputs are reversed, 
as suggested in ( [Sutskever et al., 2014[ ). 


7.4 Results 

Estimated entropies for different tasks computed 
in the proposed paradigm are presented in Table Q] 
As can be seen, MT is less complex than word 


prediction tasks, which is in line with our expec¬ 
tation: for MT, output tokens are predicated on 
source tokens. The input data provides additional 
information and lowers the degree of uncertainty: 
H(Y\X) > H(Y) for any X and Y. 

As discussed earlier, estimated entropies are 
subjective to datasets. Being significantly short 
in training data, a high level of entropy is ob¬ 
served for summarization. This phenomenon 
demonstrate one key disadvantage of the pro¬ 
posed model—the failure to consider the impact 
of datasets. In particular, we are computing the 
upper bound for a specific task given the specific 
dataset adopted. Flow to take into account the 
influence of different datasets (e.g., amounts of 
training data, quality of training data) poses a great 
challenge to developing a general NLP Engine. 

8 Deficiencies and Directions for 
Improvement 

We have proposed a paradigm with three require¬ 
ments, which we believe to be essential for a uni¬ 
versal NLP engine. We are fully aware that it is 
impossible to come up with instantiations that per¬ 
fectly meet all the requirements using current al¬ 
gorithms and frameworks. We consider the search 
for optimal solutions to be a long-term task. In 
this section we identify deficiencies involved in 
the proposed framework and suggest avenues for 
improvements. 

The Metric: We proposed to use Shannon En¬ 
tropy as uncertainty measurement to evaluate the 
complexity of tasks because we believe that en¬ 
tropy more deeply reflects the nature of uncer¬ 
tainty than other current measures such as accu¬ 
racy or recall. However, if a theoretical computer 
scientist were to develop a more optimal measure 
that avoids the dilemma described in Section 2, we 
would replace entropy with that measure. 

The Engine: In this paper, we are using an 
end-to-end turning string prediction engine, which 
says nothing substantive about complexity of re¬ 
source and intermediate procedures. This could 
be problematic. Consider the following scenarios: 
in case 1 we have a long table that lists each in¬ 
put possibility and its output answer is a simple 
lookup, where the work then goes into creating the 
table, and in case 2 we have a small resource of 
rules but a lot of feature creation and rule applica¬ 
tion in the main engine to perform the same task. 
It is then true that the entropy from input to out- 












put is the same if the two systems produce exactly 
the same output (though one takes perhaps a lot 
more time, the other requires perhaps more space). 
But is the amount of work required (and hence 
the entropy effect) to create the two resources the 
same? In other words, can one argue that because 
the ‘outside’ end-to-end turning prediction task is 
constant in entropy, therefore the inner resources 
have to contain the same amount of entropy re¬ 
ducing ‘power’? This is not necessarily true. But 
should the one resource contain significantly more 
than the other, it appears that the outside engine 
doesn’t actually use that. 

The Model: Before discussing disadvantages 
of applied recurrent neural models, it is notewor¬ 
thy that there is an alternative to a universal and 
unified model (we call it unified model for short) 
for the framework. One can instead find the best 
informants (we call such a strategy best models ) 
from various places and ask them to perform the 
transformation predictions. Alternatively, one can 
exhaust all combinations of models, algorithms, 
and features, and report the best results (smallest 
value of entropy) as the complexity comparison. 
Though all these strategies have pros and cons, 
we postulate that unified models might be more 
suitable than best models, as different informants 
might have different levels of education. 

To meet the two requirements described in Sec¬ 
tion 6, we adopted recurrent neural models. Re¬ 
current models are by no means perfect: they in¬ 
evitably forget previous information and are fun¬ 
damentally incapable of capturing long-term de¬ 
pendencies (Bengio et al., 1994). This becomes 
especially problematic in tasks where long-term 
dependencies play a vital role such as discourse 
parsing. Without trying to defend the model too 
far, we note that recurrent models seem to offer 
advantages over other current models that we can 
think of, e.g., transition models. We are optimistic 
that other and more sophisticated variations of 
neural models or other models, such as LDCRFs 
(Long-Dependency CRFs) (Morency et ah, 2007), 
memory networks ( [Weston et ah, 20151 ) will cope 
with the aforementioned disadvantages bit by bit. 
At least, one can replace recurrent models if more 
suitable algorithms come up. 


9 Conclusion: Toward a Theory of NLP 

Almost all NLP researchers today would all agree 
that there is no such thing as a theory of NLP. We 


hope that in this paper we lay some groundwork 
toward such a theory. 

Any theory addresses some complex phe¬ 
nomenon by (i) identifying some categories (of 
objects or states or events) within it, (ii) providing 
some characteristics and perhaps some definitions 
for them, (iii) if possible describing some relation¬ 
ships between them, and (iv) if possible quantify¬ 
ing (some) aspects of these relationships. A scien¬ 
tific theory measures aspects of some phenomena 
and uses rules expressing the relationships to pre¬ 
dict the values of other phenomena under certain 
conditions. 

The framework outlined in this paper names as 
categories the commonly used linguistic phenom¬ 
ena of NLP such as words, part of speech tags, 
syntactic classes, and any other linguistically mo¬ 
tivated category that NLP researchers choose to 
study. But it also has as categories various al¬ 
gorithms and data structures and other aspects of 
computation, including language models, the no¬ 
tion of training data and evaluation against a gold 
standard, classification, scoring, etc. The General 
NLP Engine puts the notions together in a single 
generic framework and suggests a way to mea¬ 
sure their separate individual characteristics with 
regard to a single whole, namely the performance 
of tasks phrased in a very generic manner. This al¬ 
lows one to hold all but one category constant and 
vary the characteristics of either a linguistic or a 
computational category and study its effect on the 
overall task relative to any other variation, even if 
applied to some other category. 

It is of course possible to generalize the Gen¬ 
eral NLP Engine to apply to many other applica¬ 
tion areas in Computer Science. However the do¬ 
main of NLP has properties that make it very at¬ 
tractive for fleshing out the nature of the Engine 
and the general ‘theory’, among others that NLP is 
a relatively mature domain within Computer Sci¬ 
ence, being just over 60 years old; NLP addresses 
a very large and complex subject field, namely nat¬ 
ural language, NLP uses a variety of quite dif¬ 
ferent techniques, including finite state transfor¬ 
mation engines, machine learning, etc., and nu¬ 
merous types of representations, including vector 
spaces, symbolic notations, and connectionist em¬ 
beddings. 

In summary, though far from perfect, this pa¬ 
per provides a first attempt to quantify NLP tasks 
under a uniform paradigm which might have the 






potential to significantly impact natural language 
processing areas. 
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10 Appendix 


Long-short Term Memory LSTM model, first 


proposed in (Hochreiter and Schmidhuber, 1997), 
maps an input sequence to a fixed-sized vector 
by sequentially convoluting the current representa¬ 
tion with the output representation of the previous 
step. LSTM associates each time epoch with an 
input, control and memory gate, and tries to min¬ 
imize the impact of unrelated information. Let¬ 
ting i t , ft and o t correspond to gate states at time 
t, et -1 and e/ denote the output representation at 
time t — 1, and t, e Xt denote the embedding as¬ 
sociated with the token at time t, as defined in 
( |Hochreiter and Schmidhuber, 1997| ), we have 


it = ct{Wi ■ e Xt + Vi ■ e t -\) 
ft = <j(Wf ■ e Xt + Vf ■ e t -i) 
o t = a(W 0 ■ e Xt + V 0 ■ e t -\) 
l t = tanh(W, -e Xt + V r e t -i) 
m t = ft- m t -1 +iflt 
e t = Of m t 


( 12 ) 


where a denotes the sigmoid function. % t , ft and 
ot are scalars within the range of [0,1]. 






